Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add mergeHistogram fast path #66

Closed
wants to merge 9 commits into from

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Aug 2, 2023

Add fast path for mergeHistogram since it is using the most time in CPU profiles. This optimizes for cases where len(from.Bucket) << len(to.Bucket) so that we will have O(m lg n) as opposed to the regular O(n + m), where m = len(from.Bucket) and n = len(to.Bucket). This idea is from #49 but not implemented in the merged #52.

benchstat:

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-aggregation/aggregators
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
                                       │ bench-out/merge-histogram-fastpath-7-before │ bench-out/merge-histogram-fastpath-7-after │
                                       │                   sec/op                    │       sec/op         vs base               │
AggregateCombinedMetrics-16                                              1.893µ ± 6%           1.878µ ± 3%        ~ (p=0.310 n=6)
AggregateBatchSerial-16                                                  4.469µ ± 5%           4.407µ ± 4%        ~ (p=0.394 n=6)
AggregateBatchParallel-16                                                4.968µ ± 6%           4.957µ ± 2%        ~ (p=0.669 n=6)
CombinedMetricsEncoding-16                                               907.6n ± 5%           902.3n ± 3%        ~ (p=0.818 n=6)
CombinedMetricsToBatch-16                                                60.76µ ± 2%           62.81µ ± 2%   +3.38% (p=0.002 n=6)
EventToCombinedMetrics-16                                                398.3n ± 5%           384.2n ± 2%   -3.54% (p=0.002 n=6)
NDJSONSerial/go-2.0.0.ndjson-16                                          16.67m ± 2%           15.33m ± 5%   -8.00% (p=0.002 n=6)
NDJSONSerial/nodejs-3.29.0.ndjson-16                                     9.418m ± 4%           8.933m ± 3%   -5.15% (p=0.004 n=6)
NDJSONSerial/python-6.7.2.ndjson-16                                      23.54m ± 3%           23.39m ± 8%        ~ (p=0.093 n=6)
NDJSONSerial/ruby-4.5.0.ndjson-16                                        13.81m ± 2%           13.31m ± 5%   -3.62% (p=0.041 n=6)
NDJSONParallel/go-2.0.0.ndjson-16                                        16.87m ± 4%           15.02m ± 4%  -10.93% (p=0.002 n=6)
NDJSONParallel/nodejs-3.29.0.ndjson-16                                   9.521m ± 9%           8.890m ± 5%   -6.62% (p=0.002 n=6)
NDJSONParallel/python-6.7.2.ndjson-16                                    23.20m ± 2%           23.00m ± 1%        ~ (p=0.394 n=6)
NDJSONParallel/ruby-4.5.0.ndjson-16                                      14.12m ± 3%           13.28m ± 3%   -5.94% (p=0.002 n=6)
geomean                                                                  397.7µ                384.7µ        -3.28%

                                       │ bench-out/merge-histogram-fastpath-7-before │ bench-out/merge-histogram-fastpath-7-after │
                                       │                    B/op                     │        B/op         vs base                │
AggregateCombinedMetrics-16                                           1.103Ki ± 1%           1.104Ki ± 1%       ~ (p=0.859 n=6)
AggregateBatchSerial-16                                               1.368Ki ± 2%           1.362Ki ± 2%       ~ (p=0.180 n=6)
AggregateBatchParallel-16                                             1.353Ki ± 2%           1.372Ki ± 1%       ~ (p=0.234 n=6)
CombinedMetricsEncoding-16                                              0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
CombinedMetricsToBatch-16                                             3.983Ki ± 0%           3.983Ki ± 0%       ~ (p=0.472 n=6)
EventToCombinedMetrics-16                                               0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
NDJSONSerial/go-2.0.0.ndjson-16                                       9.237Mi ± 1%           9.219Mi ± 1%       ~ (p=0.394 n=6)
NDJSONSerial/nodejs-3.29.0.ndjson-16                                  5.564Mi ± 1%           5.576Mi ± 1%       ~ (p=0.818 n=6)
NDJSONSerial/python-6.7.2.ndjson-16                                   15.35Mi ± 1%           15.33Mi ± 0%       ~ (p=0.310 n=6)
NDJSONSerial/ruby-4.5.0.ndjson-16                                     8.254Mi ± 0%           8.249Mi ± 0%       ~ (p=0.589 n=6)
NDJSONParallel/go-2.0.0.ndjson-16                                     9.251Mi ± 0%           9.242Mi ± 1%       ~ (p=0.180 n=6)
NDJSONParallel/nodejs-3.29.0.ndjson-16                                5.562Mi ± 1%           5.599Mi ± 1%       ~ (p=0.132 n=6)
NDJSONParallel/python-6.7.2.ndjson-16                                 15.33Mi ± 0%           15.39Mi ± 0%       ~ (p=0.132 n=6)
NDJSONParallel/ruby-4.5.0.ndjson-16                                   8.248Mi ± 0%           8.249Mi ± 0%       ~ (p=0.589 n=6)
geomean                                                                            ²                       +0.14%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                       │ bench-out/merge-histogram-fastpath-7-before │ bench-out/merge-histogram-fastpath-7-after │
                                       │                  allocs/op                  │     allocs/op       vs base                │
AggregateCombinedMetrics-16                                             17.00 ± 0%             17.00 ± 0%       ~ (p=1.000 n=6) ¹
AggregateBatchSerial-16                                                 13.00 ± 0%             13.00 ± 0%       ~ (p=1.000 n=6) ¹
AggregateBatchParallel-16                                               13.00 ± 0%             13.00 ± 8%       ~ (p=0.455 n=6)
CombinedMetricsEncoding-16                                              0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
CombinedMetricsToBatch-16                                               95.00 ± 0%             95.00 ± 0%       ~ (p=1.000 n=6) ¹
EventToCombinedMetrics-16                                               0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
NDJSONSerial/go-2.0.0.ndjson-16                                        114.0k ± 3%            113.9k ± 2%       ~ (p=1.000 n=6)
NDJSONSerial/nodejs-3.29.0.ndjson-16                                   63.18k ± 1%            63.86k ± 1%       ~ (p=0.310 n=6)
NDJSONSerial/python-6.7.2.ndjson-16                                    162.7k ± 1%            163.3k ± 0%       ~ (p=0.132 n=6)
NDJSONSerial/ruby-4.5.0.ndjson-16                                      98.75k ± 0%            98.77k ± 1%       ~ (p=0.937 n=6)
NDJSONParallel/go-2.0.0.ndjson-16                                      113.8k ± 2%            114.2k ± 1%       ~ (p=0.394 n=6)
NDJSONParallel/nodejs-3.29.0.ndjson-16                                 63.45k ± 1%            64.15k ± 1%  +1.10% (p=0.015 n=6)
NDJSONParallel/python-6.7.2.ndjson-16                                  163.1k ± 1%            162.3k ± 1%       ~ (p=1.000 n=6)
NDJSONParallel/ruby-4.5.0.ndjson-16                                    98.78k ± 0%            98.88k ± 1%       ~ (p=0.485 n=6)
geomean                                                                            ²                       +0.18%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

@carsonip carsonip requested a review from a team as a code owner August 2, 2023 11:47
fromIdx++
var extra int
toLen, fromLen := len(to.Buckets), len(from.Buckets)
if fromLen < toLen { // Heuristics to trade between O(m lg n) and O(n + m).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewer] We can change this to if fromLen < toLen / 2 or any other heuristics, but I assume this is good enough to rule out the edge case where it is 100% impossible to merge without additional slots.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if fromLen < toLen / 2 isn't faster in benchmarks

@carsonip carsonip requested a review from lahsivjar August 2, 2023 11:51
Comment on lines 406 to 416
for fromIdx := 0; fromIdx < fromLen; fromIdx++ {
toIdx, found := sort.Find(toLen, func(toIdx int) int {
return int(from.Buckets[fromIdx] - to.Buckets[toIdx])
})
if found {
to.Counts[toIdx] += from.Counts[fromIdx]
from.Counts[fromIdx] = 0
} else {
extra++
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for fromIdx := 0; fromIdx < fromLen; fromIdx++ {
toIdx, found := sort.Find(toLen, func(toIdx int) int {
return int(from.Buckets[fromIdx] - to.Buckets[toIdx])
})
if found {
to.Counts[toIdx] += from.Counts[fromIdx]
from.Counts[fromIdx] = 0
} else {
extra++
}
}
findIn := to
for fromIdx := 0; fromIdx < fromLen; fromIdx++ {
target := from.Buckets[fromIdx]
toIdx, found := sort.Find(len(findIn), func(toIdx int) int {
return int(target - findIn.Buckets[toIdx])
})
if toIdx == len(findIn) {
// there is no bucket in `findIn` which is less than target
// so there cant be any bucket in `findIn` which is more than target
break
}
if found {
findIn.Counts[toIdx] += from.Counts[fromIdx]
from.Counts[fromIdx] = 0
} else {
extra++
}
findIn = findIn[toIdx:] // next target will definitely be greater than the current one
}

We can optimize the search a bit more by considering from to be sorted. Added a sample code above but please double check as I wrote it in a hurry to suggest the idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is clever

Copy link
Member Author

@carsonip carsonip Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have an idea that will be applicable to more cases. Instead of searching in 0..toLen each time, we utilize the toIdx from the previous pass, and search in 0..toIdx+1. Will require the this loop to be reversed for fromIdx := 0; fromIdx < fromLen; fromIdx++ {.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, in your code we must ensure extra > 0 when we break, and we'll also have to correctly determine extra. These shortcircuits are dangerous 😢

			if toIdx == len(findIn) {
				break
			}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to limit the search space with the help of result of the previous iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think it might be better to just optimise the merger for exactly same buckets in to and from, if any of the bucket differs we fallback to previous logic. I think it will be simpler and give us the same benefits with slightly better performance. WDYT?

Copy link
Member Author

@carsonip carsonip Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any of the bucket differs we fallback to previous logic. I think it will be simpler and give us the same benefits with slightly better performance.

Yes we can fallback to the previous logic, but we'll still have to pick between O(n+m) or O(m lg n) for counting extra buckets, so we can grow the slice in one go. Since in a lot of cases m << n, having O(m lg n) just for counting may still be faster than O(n+m). Imo the performance isn't that clear cut. Am I missing anything?

Copy link
Member Author

@carsonip carsonip Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, since the merge is O(n+m), then the runtime of the whole function where we do binary search without falling back immediately becomes max(O(m lg n), O(n+m)). Yes, in theory if we fallback early, there is slightly better performance. I'll see how I can incorporate this.

@carsonip
Copy link
Member Author

carsonip commented Aug 2, 2023

Closing this in favor of #69

@carsonip carsonip closed this Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants